AITopics

Genre: Research Report (0.46)

Industry: Health & Medicine > Pharmaceuticals & Biotechnology (1.00)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.94)
Information Technology > Artificial Intelligence > Representation & Reasoning (0.86)

Neural Information Processing SystemsFeb-15-2026, 00:24:22 GMT

Supplement WelQrate: Defining the Gold Standard in Small Molecule Drug Discovery Benchmarking T able of Contents

If taking a closer look at the MedDRA classification on the system organ level on its website, we can find a claim of "System Organ Classes (SOCs) which are groupings by aetiology (e.g. However, as claimed in the original paper, "It should be noted that we did not perform any preprocessing of our datasets, such as Tab. These datasets appear in MoleculeNet as well. As mentioned in the introduction in the main paper, there are also issues with inconsistent representations and undefined stereochemistry. We list an example for each in Figure 1 and Figure 1.

artificial intelligence, compound, machine learning, (15 more...)

Country:

North America > United States > Delaware > New Castle County > Wilmington (0.04)
Europe > Germany > Bavaria > Middle Franconia > Nuremberg (0.04)

Industry:

Health & Medicine > Pharmaceuticals & Biotechnology (1.00)
Health & Medicine > Therapeutic Area > Infections and Infectious Diseases (0.93)
Health & Medicine > Therapeutic Area > Oncology (0.68)
Health & Medicine > Therapeutic Area > Neurology > Alzheimer's Disease (0.46)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.95)

Neural Information Processing SystemsFeb-11-2026, 11:15:43 GMT

ToDD: TopologicalCompoundFingerprintingin Computer-AidedDrugDiscovery

In computer-aided drug discovery (CADD), virtual screening (VS) is used for identifying the drug candidates that are most likely tobind toamolecular target inalargelibraryofcompounds.

artificial intelligence, compound, machine learning, (17 more...)

Country: North America > Canada > Quebec > Montreal (0.04)

Industry: Health & Medicine > Pharmaceuticals & Biotechnology (1.00)

Technology: Information Technology > Artificial Intelligence > Machine Learning (1.00)

arXiv.org Artificial IntelligenceDec-5-2025

Fine-Tuning ChemBERTa for Predicting Inhibitory Activity Against TDP1 Using Deep Learning

Zeng, Baichuan

Predicting the inhibitory potency of small molecules against Tyrosyl-DNA Phosphodiesterase 1 (TDP1) -- a key target in overcoming cancer chemoresistance--remains a critical challenge in early drug discovery. We present a deep learning framework for the quantitative regression of pIC50 values from molecular Simplified Molecular Input Line Entry System (SMILES) strings using fine-tuned variants of ChemBERTa, a pre-trained chemical language model. Leveraging a large-scale consensus dataset of 177,092 compounds, we systematically evaluate two pre-training strategies--Masked Language Modeling (MLM) and Masked Token Regression (MTR)--under stratified data splits and sample weighting to address severe activity imbalance which only 2.1% are active. Our approach outperforms classical baselines Random Predictor in both regression accuracy and virtual screening utility, and has competitive performance compared to Random Forest, achieving high enrichment factor EF@1% 17.4 and precision Precision@1% 37.4 among top-ranked predictions. The resulting model, validated through rigorous ablation and hyperparameter studies, provides a robust, ready-to-deploy tool for prioritizing TDP1 inhibitors for experimental testing. By enabling accurate, 3D-structure-free pIC50 prediction directly from SMILES, this work demonstrates the transformative potential of chemical transformers in accelerating target-specific drug discovery.

artificial intelligence, machine learning, sample weighting, (18 more...)

2512.04252

Genre: Research Report > Experimental Study (0.66)

Industry: Health & Medicine > Pharmaceuticals & Biotechnology (1.00)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Neural Information Processing SystemsOct-10-2025, 04:08:08 GMT

Supplement WelQrate: Defining the Gold Standard in Small Molecule Drug Discovery Benchmarking T able of Contents

active compound, compound, dataset, (13 more...)

Country:

North America > United States > Delaware > New Castle County > Wilmington (0.04)
Europe > Germany > Bavaria > Middle Franconia > Nuremberg (0.04)

Industry:

Health & Medicine > Pharmaceuticals & Biotechnology (1.00)
Health & Medicine > Therapeutic Area > Infections and Infectious Diseases (0.93)
Health & Medicine > Therapeutic Area > Oncology (0.68)
Health & Medicine > Therapeutic Area > Neurology > Alzheimer's Disease (0.46)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.95)

arXiv.org Artificial IntelligenceNov-14-2024

WelQrate: Defining the Gold Standard in Small Molecule Drug Discovery Benchmarking

Liu, Yunchao, Dong, Ha, Wang, Xin, Moretti, Rocco, Wang, Yu, Su, Zhaoqian, Gu, Jiawei, Bodenheimer, Bobby, Weaver, Charles David, Meiler, Jens, Derr, Tyler

While deep learning has revolutionized computer-aided drug discovery, the AI community has predominantly focused on model innovation and placed less emphasis on establishing best benchmarking practices. We posit that without a sound model evaluation framework, the AI community's efforts cannot reach their full potential, thereby slowing the progress and transfer of innovation into real-world drug discovery. Thus, in this paper, we seek to establish a new gold standard for small molecule drug discovery benchmarking, WelQrate. Specifically, our contributions are threefold: WelQrate Dataset Collection - we introduce a meticulously curated collection of 9 datasets spanning 5 therapeutic target classes. Our hierarchical curation pipelines, designed by drug discovery experts, go beyond the primary high-throughput screen by leveraging additional confirmatory and counter screens along with rigorous domain-driven preprocessing, such as Pan-Assay Interference Compounds (PAINS) filtering, to ensure the high-quality data in the datasets; WelQrate Evaluation Framework - we propose a standardized model evaluation framework considering high-quality datasets, featurization, 3D conformation generation, evaluation metrics, and data splits, which provides a reliable benchmarking for drug discovery experts conducting real-world virtual screening; Benchmarking - we evaluate model performance through various research questions using the WelQrate dataset collection, exploring the effects of different models, dataset quality, featurization methods, and data splitting strategies on the results. In summary, we recommend adopting our proposed WelQrate as the gold standard in small molecule drug discovery benchmarking. The WelQrate dataset collection, along with the curation codes, and experimental scripts are all publicly available at WelQrate.org.

artificial intelligence, deep learning, machine learning, (16 more...)

2411.0982

Country:

North America > United States > New York > New York County > New York City (0.04)
North America > United States > Delaware > New Castle County > Wilmington (0.04)
Europe > Germany > Bavaria > Middle Franconia > Nuremberg (0.04)
(2 more...)

Genre: Research Report > New Finding (0.48)

Industry:

Health & Medicine > Pharmaceuticals & Biotechnology (1.00)
Health & Medicine > Therapeutic Area > Oncology (0.93)
Health & Medicine > Therapeutic Area > Infections and Infectious Diseases (0.93)
Health & Medicine > Therapeutic Area > Neurology > Alzheimer's Disease (0.46)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.94)

arXiv.org Artificial IntelligenceDec-11-2023

Generation of 3D Molecules in Pockets via Language Model

Feng, Wei, Wang, Lvwei, Lin, Zaiyun, Zhu, Yanhao, Wang, Han, Dong, Jianqiang, Bai, Rong, Wang, Huting, Zhou, Jielong, Peng, Wei, Huang, Bo, Zhou, Wenbiao

Generative models for molecules based on sequential line notation (e.g. SMILES) or graph representation have attracted an increasing interest in the field of structure-based drug design, but they struggle to capture important 3D spatial interactions and often produce undesirable molecular structures. To address these challenges, we introduce Lingo3DMol, a pocket-based 3D molecule generation method that combines language models and geometric deep learning technology. A new molecular representation, fragment-based SMILES with local and global coordinates, was developed to assist the model in learning molecular topologies and atomic spatial positions. Additionally, we trained a separate noncovalent interaction predictor to provide essential binding pattern information for the generative model. Lingo3DMol can efficiently traverse drug-like chemical spaces, preventing the formation of unusual structures. The Directory of Useful Decoys-Enhanced (DUD-E) dataset was used for evaluation. Lingo3DMol outperformed state-of-the-art methods in terms of drug-likeness, synthetic accessibility, pocket binding mode, and molecule generation speed.

atom, lingo3dmol, molecule, (16 more...)

2305.10133

Country:

Asia > China > Beijing > Beijing (0.04)
Asia > China > Guangdong Province > Guangzhou (0.04)
Europe > Germany > Rheinland-Pfalz > Mainz (0.04)

Genre:

Research Report > New Finding (0.46)
Research Report > Promising Solution (0.34)

Industry: Health & Medicine > Pharmaceuticals & Biotechnology (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Warszycki, Dawid, Struski, Łukasz, Śmieja, Marek, Kafel, Rafał, Kurczab, Rafał

Pharmacoprint -- a combination of pharmacophore fingerprint and artificial intelligence as a tool for computer-aided drug design

arXiv.org Artificial IntelligenceOct-31-2023

Structural fingerprints and pharmacophore modeling are methodologies that have been used for at least two decades in various fields of cheminformatics: from similarity searching to machine learning (ML). Advances in silico techniques consequently led to combining both these methodologies into a new approach known as pharmacophore fingerprint. Herein, we propose a high-resolution, pharmacophore fingerprint called Pharmacoprint that encodes the presence, types, and relationships between pharmacophore features of a molecule. Pharmacoprint was evaluated in classification experiments by using ML algorithms (logistic regression, support vector machines, linear support vector machines, and neural networks) and outperformed other popular molecular fingerprints (i.e., Estate, MACCS, PubChem, Substructure, Klekotha-Roth, CDK, Extended, and GraphOnly) and ChemAxon Pharmacophoric Features fingerprint. Pharmacoprint consisted of 39973 bits; several methods were applied for dimensionality reduction, and the best algorithm not only reduced the length of bit string but also improved the efficiency of ML tests. Further optimization allowed us to define the best parameter settings for using Pharmacoprint in discrimination tests and for maximizing statistical parameters. Finally, Pharmacoprint generated for 3D structures with defined hydrogens as input data was applied to neural networks with a supervised autoencoder for selecting the most important bits and allowed to maximize Matthews Correlation Coefficient up to 0.962. The results show the potential of Pharmacoprint as a new, perspective tool for computer-aided drug design.

fingerprint, pharmacoprint, representation, (15 more...)

doi: 10.1021/acs.jcim.1c00589

2110.01339

Country: Europe > Poland > Lesser Poland Province > Kraków (0.14)

Genre: Research Report > New Finding (0.69)

Industry:

Health & Medicine > Pharmaceuticals & Biotechnology (1.00)
Health & Medicine > Therapeutic Area > Infections and Infectious Diseases (0.48)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Support Vector Machines (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.67)

arXiv.org Artificial IntelligenceAug-14-2022

Widely Used and Fast De Novo Drug Design by a Protein Sequence-Based Reinforcement Learning Model

Li, Yaqin, Li, Lingli, Xu, Yongjin, Yu, Yi

De novo molecular design has facilitated the exploration of large chemical space to accelerate drug discovery. Structure-based de novo method can overcome the data scarcity of active ligands by incorporating drug-target interaction into deep generative architectures. However, these strategies are bottlenecked by the small fraction of experimentally determined protein or complex structures. In addition, the cost of molecular generation is computationally expensive due to 3D representations of both molecule and protein. Here, we demonstrate a widely used and fast protein sequence-based reinforcement learning (RL) model for drug discovery. In the generative model, one of the reward components, a binding affinity predictor, is based on 1D protein sequence and molecular SMILES. As a proof of concept, the RL model was utilized to design molecules for four targets. The generated compounds showed bioactivities by the validation of both QSAR and molecular docking with experimental 3D binding pockets. We also found that the performance of generated molecules depends on the selection of data source training for the binding predictor. Furthermore, drug design for a kinase without any experimental structure, CDK20, was studied by our model. With only 1D protein sequence as input, the generated novel compounds showed favorable binding affinity based on the AlphaFold predicted structure.

artificial intelligence, machine learning, molecule, (18 more...)

2209.07405

Country:

Europe > Sweden > Vaestra Goetaland > Gothenburg (0.05)
Asia > China > Zhejiang Province > Hangzhou (0.04)
Asia > China > Sichuan Province > Chengdu (0.04)

Genre: Research Report (0.40)

Industry: Health & Medicine > Pharmaceuticals & Biotechnology (1.00)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.47)

Dey, Vishal, Machiraju, Raghu, Ning, Xia

Improving Compound Activity Classification via Deep Transfer and Representation Learning

arXiv.org Artificial IntelligenceNov-14-2021

Recent advances in molecular machine learning, especially deep neural networks such as Graph Neural Networks (GNNs) for predicting structure activity relationships (SAR) have shown tremendous potential in computer-aided drug discovery. However, the applicability of such deep neural networks are limited by the requirement of large amounts of training data. In order to cope with limited training data for a target task, transfer learning for SAR modeling has been recently adopted to leverage information from data of related tasks. In this work, in contrast to the popular parameter-based transfer learning such as pretraining, we develop novel deep transfer learning methods TAc and TAc-fc to leverage source domain data and transfer useful information to the target domain. TAc learns to generate effective molecular features that can generalize well from one domain to another, and increase the classification performance in the target domain. Additionally, TAc-fc extends TAc by incorporating novel components to selectively learn feature-wise and compound-wise transferability. We used the bioassay screening data from PubChem, and identified 120 pairs of bioassays such that the active compounds in each pair are more similar to each other compared to its inactive compounds. Overall, TAc achieves the best performance with average ROC-AUC of 0.801; it significantly improves ROC-AUC of 83% target tasks with average task-wise performance improvement of 7.102%, compared to the best baseline FCN-dmpna (DT). Our experiments clearly demonstrate that TAc achieves significant improvement over all baselines across a large number of target tasks. Furthermore, although TAc-fc achieves slightly worse ROC-AUC on average compared to TAc (0.798 vs 0.801), TAc-fc still achieves the best performance on more tasks in terms of PR-AUC and F1 compared to other methods.

bioassay, compound, target task, (16 more...)